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Given a finite family of functions, tlie goal of model selection ag- 

Cn I gregation is to construct a procedure that mimics the function from 

this family that is the closest to an unknown regression function. More 
precisely, we consider a general regression model with fixed design and 
measure the distance between functions by the mean squared error at 
the design points. While procedures based on exponential weights are 
known to solve the problem of model selection aggregation in expec- 

tG , tation, they are, surprisingly, sub-optimal in deviation. We propose 

jrt ' a new formulation called Q-aggregation that addresses this limita- 

tion; namely, its solution leads to sharp oracle inequalities that are 
optimal in a minimax sense. Moreover, based on the new formula- 
tion, we design greedy Q-aggregation procedures that produce sparse 

^SJ I aggregation models achieving the optimal rate. The convergence and 

^ . performance of these greedy procedures are illustrated and compared 

C^^ ' with other standard methods on simulated examples. 

O 

^SJ , 1. Introduction. Model selection is one of the major aspects of statisti- 

^f^ ' cal learning and, as such, has received considerable attention over the past 

(^ . decades. More recently, the seminal works of Nemirovski (2000) and Tsy- 

^ I bakov (2003) have introduced an idealized setup to study the properties 

of model selection procedures independently of the models themselves. We 
consider this so-called pure model selection aggregation (or simply MS ag- 
k> ' gregation) framework for the simple model of Gaussian regression with fixed 

^ . design. 

Let xi, . . . ,Xnhe n given design points in a space X, and let Ti = {/i, . . . , 
/a/} be a given dictionary of real valued functions on X. The goal is to 
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2 D. DAI, P. RIGOLLET AND T. ZHANG 

estimate an unknown regression function r]:X ^M. at the design points 
based on observations 

Yi = rj{xi) + Ci, 

where ^i, • • • ,^n are i.i.d. A/'(0,o"^). Our main results are actually stated for 
sub-Gaussian random variables, but since most of the literature is available 
only for Gaussian noise, we temporarily make this assumption to ease com- 
parisons throughout the Introduction. The performance of an estimator j) is 
measured by its mean square error (MSE) defined by 



1 " 
MSE(77) = -V(r?(xi)-r/(x,))' 



In the pure model selection aggregation framework, the goal is to build an 
estimator f/ that mimics the function fj in the dictionary with the small- 
est MSE. Formally, a good estimator fj should satisfy the following oracle 
inequality in a certain probabilistic sense: 

(1.1) MSE(7?)< min MSE(/,) + A„,M(a2), 

j=l,...,M 

where the remainder term An^M > should be as small as possible. Note 
that oracle inequality (1.1) is a truly finite sample result, and the remainder 
term should show the interplay between the three fundamental parameters 
of the problem: the "dimension" M, the sample size n and the noise level a^. 
Most oracle inequalities for model selection aggregation have been produced 
in expectation [see the references in Rigollet and Tsybakov (2012)] with 
notable exceptions [Audibert (2008), Lecue and Mendelson (2009), Gai'ffas 
and Lecue (2011), Dai and Zhang (2011), Rigollet (2012)] who produced 
oracle inequalities that hold with high probability and to which we will 
come back later. 

From the early days of the pure model selection problem, it has been 
estabhshed [see, e.g., Tsybakov (2003), RigoUet (2012)] that the smallest 
possible order for A„^a/((T^) was a^ (log M)/n for oracle inequalities in ex- 
pectation, where "smallest possible" is understood in the following minimax 
sense. There exists a dictionary Ti = {/i,---,/m} such that the following 
holds. For any estimator fj, there exists a regression function r] such that 

EMSE(f7)> min MSE(/,) + Ca^i^^^ 
j=i,...,M ■' n 

for some positive constant C. Moreover, it follows from the same results 
that this lower bound holds not only in expectation but also with positive 
probability. 

The established terminology model selection is somewhat misleading. In- 
deed, while the goal is to mimic the best model in the dictionary %, it 
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has been shown [see RigoUet and Tsybakov (2012), Theorem 2.1] that there 
exists a dictionary % such that any estimator (selector) fj restricted to be 
one of the elements of % cannot satisfy an oracle inequality such as (1.1) 
with a remainder term of order smaller than cjy^(logM)/n, which is clearly 
suboptimal. Rather than model selection, model averaging has been suc- 
cessfully employed to derive oracle inequalities in expectation such as (1.1). 
More precisely, model averaging consists in choosing ^ as a convex combi- 
nation of the fjS with carefully chosen weights. Let A be the flat simplex 
of M.^' defined by 

A = I A = (Ai, . . . , Am)^ G M*^: a, > 0,f^ a,- = 1 1. 

Each A G A yields an aggregate estimator ?/ = f a , where 

M 

J=l 

This is why we refer to this problem as model selection aggregation or MS 
aggregation. The early papers of Catoni (1999) and Yang (1999) introduced 
and proved optimal theoretical guarantees for a model averaging estima- 
tor called progressive mixture that was later studied in Audibert (2008) 
and Juditsky, Rigollet and Tsybakov (2008) from various perspectives. This 
estimator is based on exponential weights^ which, since then, have been pre- 
dominantly used and have led to optimal oracle inequalities in expectation. 
Let vr = (vTi, . . . ^i^m) E A be a given prior and /3 > be a temperature 
parameter, then the jth exponential weight is given by 

(1.2) \^^^ oc -Kj exp(-nMSE(/j)//3), 

where 



— — 1 " 

MSE(/,) = -^(l^,-/,(x,))' 



n ■ 



The most common prior choice is the uniform prior vr = (1/M, . . . , 1/M)^, 
but other choices that put more or less weight on different functions of the 
dictionary have been successfully applied to various related problems; see, for 
example, Dalalyan and Salmon (2011), Rigollet and Tsybakov (2011, 2012). 
Note that progressive mixture contains an extra averaging step which is 
irrelevant to the fixed design problem that we study here, but we implement 
it in Section 5 for comparison with the nonaveraged procedure. 

The fixed design Gaussian regression was considered in Dalalyan and 
Tsybakov(2007, 2008) who proved an oracle inequality of the form (1.1) 
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with optimal remainder term. This resuh suffers two deficiencies: first, it 
can be extended to other types of noise, but not to sub-Gaussian distri- 
butions in general. Second, and perhaps most importantly, it holds only 
in expectation and not with high probability. While this second limitation 
may have followed the proof technique, we actually show in Section 3 that 
it is inherent to exponential weights. Consequently, we say that exponential 
weights are deviation suboptimal since the expectation of the resulting MSE 
is of the optimal order, but the deviations around the expectation are not. 
Note also that the original paper of Dalalyan and Tsybakov (2007) made 
some boundedness assumption on the distance between function in the dic- 
tionary Ti and the regression function rj. This assumption was lifted in their 
subsequent paper [Dalalyan and Tsybakov (2008)]. In this paper, we make 
no such assumption except for the lower bound, which, of course, makes our 
result even stronger. 

For regression with random design, Audibert (2008) observed also that 
various progressive mixture rules are deviation suboptimal. In the same pa- 
per, he addressed this issue by proposing the STAR algorithm which is 
optimal both in expectation and in deviations under the uniform prior and, 
remarkably, does not require any parameter tuning as opposed to progres- 
sive mixture rules. Also for random design, Lecue and Mendelson (2009) 
followed by Gai'ffas and Lecue (2011) proposed deviation optimal methods 
based on the same sample splitting idea. However, sample splitting method 
do not carry to fixed design. Subsequently, Rigollet (2012) proposed a new 
estimator, similar to the one studied in the rest of the paper and that en- 
joys the same theoretical properties as the STAR algorithm but for fixed 
design regression. However, while it is the solution of a convex optimization 
problem, Rigollet's method comes without implementation. Finally, a first 
implementation of a greedy algorithm that enjoys optimal deviation was 
proposed in Dai and Zhang (2011). Our subsequent results extend both the 
results of Rigollet (2012) and Dai and Zhang (2011) in various directions. 

In Section 2 of the present paper, we study the deviation suboptimality 
of two commonly used aggregate estimators: the aggregate by exponential 
weights and the aggregate by projection. Then, in Section 3, we extend the 
original method of Rigollet (2012) in several directions. First and foremost, 
our extension allows us to put a prior weight on each element of the dic- 
tionary. These prior weights appear explicitly in the oracle inequalities that 
are derived in Section 3. Both the method of Rigollet (2012) and ours are 
solutions of convex optimization problems. In Section 4, we propose effi- 
cient greedy model averaging (GMA) procedures that approximately solve 
the newly proposed Q-aggregation formulations. It is shown that GMA can 
produce sparse model aggregates that achieve optimal deviation bounds. 
The performance of different model selection and aggregation estimators 
are compared in Section 5. 
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Notation. For any vector f , we denote by Vj its jth coordinate. More- 
over, for any functions /, 51 : Af — )■ M, we define the pseudo-norm 



and the associated inner product 



2 _ 
n 



1 " 
(/,£/) = -^f{xi)g{xi). 



n 

i=l 



Also, we define the function Y : A' — t- M to be any function such that Y(a;j) = 
Yi. Observe that with the above notation, we have 

MSE(/) = ||Y-/||2, MSE(/) = ||7?-/f. 

Finally, for any p > 1, we denote by | • |p the ip norm. 

2. Deviation suboptimality of commonly used estimators. It is well known 
[see, e.g., Rigollet and Tsybakov (2012)] that the exponential weights A^-^^ 
defined in (1.2) are the solution of the following minimization problem: 

(2.1) A^^P G argminj f; A,MSE(/,) + ^f] A, log(^ U. 

It was shown in Dalalyan and Tsybakov (2007, 2008) that for /? > 4o"^, it 
holds 

(2.2) EMSE(fxExp)< min JmSE(/,) + -log(^7^)l. 

j=i,...,M [ ' n J ] 

The proof of this result relies heavily on the fact that the oracle inequality 
holds in expectation and whether the result also holds with high probabil- 
ity arises as a natural question. While the paper of Audibert (2008) does 
not cover the fixed design Gaussian regression framework of our paper and 
concerns exponential weights with an extra averaging step, it contributed 
to the common belief that exponential weights would be suboptimal in de- 
viation. In particular, Lecue and Mendelson (2012) derived lower bounds 
for the performance of exponential weights in expectation when (3 is chosen 
below a certain constant threshold in the case of regression with random de- 
sign. Moreover, they proved deviation sub-optimality of exponential weights 
when (3 is less than y/n/{\ogn). However, these lower bounds rely heavily 
on the fact that the design is random and do not extend to the fixed design 
case. In particular, their construction uses Y = Q, which is clearly an easy 
problem in the fixed design case. Proposition 2.1 states precisely that expo- 
nential weights are deviation suboptimal, if /3 is chosen small enough and in 
particular if /? is any constant with respect to M and n. 
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Another natural solution to solve the MS aggregation problem is to take 
the vector of weights A defined by 

(2.3) A^^^J G argminMSE(fA). 

AeA 

We call A^^^'^ the vector of projection weights since the aggregate estimator 
f^jPROj is the projection of Y onto the convex hull of the fjS. 

It has been established that this choice is near-optimal for the more dif- 
ficult problem of convex aggregation with fixed design [see Juditsky and 
Nemirovski (2000), Nemirovski (2000), Rigollet (2012)] where the goal is to 
mimic the best convex combination of the fjS as opposed to simply mimick- 
ing the best of them. More precisely, it follows from Theorem 3.5 in Rigollet 
(2012) that 



E MSE(f;,PROj ) < min MSE(fA) + 2a]' ^ 



AeA V n 



< min MSE(/,) + 2a'\/i^^, 
j=i,...,M ^ ■" V n 

and a similar oracle inequality also holds with high probability. The sec- 
ond inequality is very coarse, and it is therefore natural to study whether 
a finer analysis of this estimator would yield an optimal oracle inequality 
for the aggregate f^pROj both in expectation and with high probability. This 
question was investigated by Lecue and Mendelson (2009) who proved that 
f AFRO J cannot satisfy an oracle inequality of the form (1.1) with high prob- 
ability and with a remainder term A„^a,/((T^) of order smaller than n~^'^. 
Their proof, however, heavily uses the fact that the design is random, and 
we extend it to the fixed design case in Proposition 2.2 below. 

For both aggregates considered below, we use the following notation. For 
each J = 1, . . . , M, we identify the functions fj on {xi, . . . , Xn} with a vector 
fi ~ {fji^i)^ • • • ' fji^n)) G ^'^ where we systematically use the gothic font to 
identify such vectors throughout the rest of the section. Moreover, for any 
vector of weights A G M^-'^, we write f^ = {fx{xi), . . . , ix{xn))- 

2.1. Aggregate by exponential weights. Consider the following dictionary 7^ . 
Assume that M,n>3. Let e^^) = (1, 0, . . . , 0)"^ € M" and e^^) = (0, 1, 0, ... , 
O)"*" E M" be the first two vectors of the canonical basis of M". Moreover, let 
e(^) , . . . , e^*^) e M*^ be M - 2 unit vectors of M" that are orthogonal to both 
e^^^ and e^^^ Let fi, . . . ,fM be such that 

fi = (J V^e(i) , f2 = a{l + V^)e(2) , 

and for any j = 3, . . . , M, fj is defined by 

fj = f2 + era je^^\ 
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where 03, ... , aM > are tuning parameters to be chosen later. Moreover, 
take the regression function rj = so that MSE(/i) < MSE(/j) for any j > 2. 
Observe that \\fj\\ > cr so that the following lower bounds cannot be inter- 
preted as artifacts of scaling the signal-to-noise ratio. 

Assume that M > 4 and n > 3. We call low temperatures, parameters 
/3 > such that 

(24) B< '^'^ 

^ ' ^ - log(8 V^) • 

In particular the exponential weights employed in the literature on MS ag- 
gregation use the low temperature /3 = 4cr^; see, for example, (2.2) above. 

Proposition 2.1. Fix M > 4,ri > 3 and assume that the noise random 
variables S,i,---,Cn o,re i.i.d. AA(0, cr^). Let r/ and Ti be defined as above. 
Then, the aggregate estimator f;)^EXp with exponential weights A given 

by (1.2) satisfies 

2 

MSE(f,Exp)> min MSE(/,) + -^, 

jr = l,...,M Ayjn 

with probability at least 0.07 at low temperatures, for any 03, . . . ,aM > 0. 
Moreover, if M > 8\/n and for any j > 3, we have 

(2.5) 2v^21og(100Af) < aj < n^/^ 

then, the same result holds at any temperature, with probability at least 0.06. 

Proof. Note first that by homogeneity, one may assume that a = 1. 
Moreover, write for simplicity A = A^-^^. If we assume Ai < 1/2, we obtain 



(2.6) 



lfAli-lfili>|Aifi + (i-Ai)f2|^-lfil^ 

= (1-Ai)^|f2|i-(1-Af)|fi|i 

> 2(1 - \ifV^+ [(1 - Ai)2 - (1 - \l)]n 
>^/n/2-2\in. 



We first treat the low temperature case where j3 is chosen as in (2.4). 
Define the event 

E = {nMSE(/2) + 2^ < nMSE(/i)}, 

and observe that 77 = gives 

(2.7) ^ = {2(f2-fi,02>|f2|l-|fi|i + 2^A^}. 

On the one hand, we have |f2|2 ~ Ifili = 1 + 2-^/n, and on the other hand 

\h - fili = If2li + Ifil2 = (2n + 2^^+ 1) > i(l + AV^f. 
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Thus, we have 

(2.8) F{E) > P(2(f2 - fi, O2 > 2\/2|f2 - fib) = P(Z > ^2) > 0.07, 

where Z ~ A/'(0, 1). In view of (1.2), on the event E, we have 

Ai < X,e~'/^^ <^<\ 
8-y/n 2 

for low temperature /3 chosen as in (2.4). Together with (2.6), it yields 

\f |2 If |2 ^ V^ 

We now turn to the case of potentially high temperatures. Actually, the 
following proof holds for any temperature /3 as long as the ajS are chosen 
small enough. In this case, we can expect the M exponential weights to take 
comparable values. To that end, define for each j = 2, . . . , M, the event 

F,={MSE(/j)<MSE(/i)}. 

Define F = P| -^j-^j' ^^^ denote by F^ the complement of Fj. Recall that 
Ifil2 = If2l2 + «| so that 

F/ = {2(f,-fi,02<|f,li-|fi|^} 

= {2(f2 - fi,02 + 2(f, - f2,02 < If2|^ - Ifili + a]} 

where the E is defined in (2.7), and Gj is defined as 

G, = {2{fj-f2,02<a]-2V^}. 
In view of (2.5), we have 

¥{Gj) < P(2(f, - f2,02 < -a]) < nZ > V21og(100A/)) < ^. 
Therefore, 

FiF") < FiE") + Y^ ¥{Gj) < 0.93 + 0.01 = 0.94. 

i=2 

Note now that on the event F, for any j = 2, . . . , M, we have \j > Ai so that 
Ai < l/M < 1/2. Together with (2.6), it yields 

Ia 9- fi 9> > -^^, 

l)A|2 1)112- 2 M ~ 4 

where, in the last inequality, we used the fact that M > 8-^/n. D 
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2.2. Aggregate by projection. Our lower bound for the aggregate by pro- 
jection relies on a different construction of the dictionary. Let m be the 
smallest integer that satisfies m^ > 4n/13 and let n,M be large enough to 
ensure that tti > 16, M — 1 > 2m. Let e^^^ , . . . , 6^"^^ € M" be the first m vec- 
tors of the canonical basis of M". For any j = 1, . . . ,M, the fjS are defined 
as 

if 1 < J < 771, 

if ?7i -|- 1 < j < 2m, 
if j = 2m + 1, 
if j >2m + l. 
Moreover, define r? = so that = MSE(f2m+i) < MSE(fj) for all j < M. 

Proposition 2.2. Fix n > 416, M > ^/n, and assume that the noise 
random variables ^i,...,^n are i.i.d. M{0,a^). Let rj and % be defined as 
above. Then the projection aggregate estimator f;j^pRo.j with weights A^^*-^'^ 
defined in (2.3) is such that 




MSE(f;,Paoj) > . min MSE{fj) + 



2 



with probability larger than 1/4. Moreover, the above lower bound holds with 
arbitrary large probability if n is chosen large enough. 

Proof. Note first that by homogeneity, one may assume that a = 1. 
Next, observe that f;^pROj = (Pm?,0, . . . ,0)"^ G R", where P^^ G M™ is the 
projection of ^ = (Ci, • • • ,Cm)''' onto B'^{^/n), the £i-ball of M"^ with ra- 
dius ^/n. 

Let E denote the event on which |^|i < y/n and observe that, on this 
event, we have Pmi = C- It yields 

m 

nMSE(f;,PROj) = ^eJ = |l|i. 

Let now F denote the event on which |^|| > m/2, and note that on E H F, 
it holds 



m / 1 

MSEfe„„„)>->\/ — . 

To conclude our proof, it remains to bound from below the probability of 
EnF. The bounds below follow from the fact that |^|2 follows a chi-squared 
distribution with m degrees of freedom. We begin by the event E. Using 
Holder's inequality, we have 

F{E') < pf Ifli > -) =F(\i\l - ml >--m 
\ m J \ m 
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Next, using the fact that m^ < 8n/13 together with Laurent and Massart 
[(2000), Lemma 1], we get 

F{E') < p(^icii - ml > ^) < e^™/''- 

Moreover, using Laurent and Massart [(2000), Lemma 1], we also get that 
P(F^) = p(|f li - ml < -|) < e-™/i6. 

Therefore, since n > 416 imphes m> 16, we get 

F{E n F) > 1 - FiE"") - P(F^) > 1 - 2e-™/^^ > 1 - 2/e > 1/4. D 

Note that we employed a different dictionary for each of the aggregates. 
Therefore, it may be the case that choosing the right aggregate for the 
right dictionary gives the correct deviation bounds. In the next section, we 
propose a new aggregate estimator called Q-aggregate, that automatically 
adjusts the aggregate to the dictionary at hand. 

3. Deviation optimal model selection by Q-aggregation. According 
to (2.1), the weight vector A^-^^ considered in the previous section min- 
imizes a penalized linear interpolation of the function A — t- MSE(f;^). The 
major novelty of the method introduced in Rigollet (2012) compared to ex- 
ponential weighting is to add a quadratic term to this linear interpolation. 
We introduce a family of estimators that extends the original estimator of 
Rigollet in two directions: (i) it allows for a prior weighting of the functions 
in the dictionary, and (ii) it allows to put different weight of each of the 
component of the fitting criterion via the tuning parameter z/ introduced 
below. 

Let TT G A be a given prior, and define the following entropic penalty: 

M 



MA,vr) = ^A,logf^. 

,=1 V ^i y 



where p is a real valued function on [0, 1] that satisfies p{t) > t such that 
1 1— 7- t\ogp{t) is convex. We are particularly interested in the choices /o = 1, 
the constant function equal to 1, which leads to a penalty that is linear in A, 
and p{t) = t, the identity function of [0, 1], which leads to the well-known 
Kullback-Leibler penalty employed in exponential weights. 

Given a dictionary Ti and observations Yi, . . . ,Yn, let Q : A — )• M be the 
function defined by 

(3.1) Q(A) = (1 - zy)MSE(fA) + zy^ AjMSE(/j) + ^/Cp(A,^), 
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where v G [0, 1]. Let A G A be such that 

(3.2) AGargminQ(A). 

aga 

We cah fr the Q-aggregate estimator. Note that on the one hand, if z^ = 1 

and p{t) = t, then A = A^-^^, the exponential weights defined in (1.2). On 
the other hand, choosing i^ = 0, p{t) = 1 and vr to be the uniform prior yields 
A = A , the projection weights. 

The next theorem shows that the Q-aggregate estimator is optimal both 
in expectation and in deviation. It holds under less restrictive conditions 
on the noise random variable ^i , . . . , ^„ . We say that the random vector 
(^ = (^1, . . . , ^n)~^ is sub-Gaussian with variance proxy a^ > 0, if for all t S M", 
its moment generating function satisfies 

Note that if ^ ~ A/'n(0, S), then ^ is sub-Gaussian with variance proxy given 
by o"^ = ||S||op, where ||S||op denotes the largest eigenvalue of the covariance 
matrix S. 

Let P be defined on the simplex A by 

M 

P{X) = (1 - u)MSE{h) + i^^XjMSE{f,). 

Theorem 3.1. Fix v G (0,1) and vr G A. Moreover, assume that the 
noise random variables Ci)- ••)'?■« ^'^e independent and sub-Gaussian with 
variance proxy a^ . Then for any /3 > ^. ?^j^_ x and any 6 G (0,1), the Q- 
aggregate estimator f^ satisfies 

MSE(f5;) < min|p(A) + -/Cp(A, vr) + ^ log(lM) 
AgA [ n n 

with probability 1 — 6. Moreover, 

EMSE(f^) < min|p(A) + -ICp{X,Tr) 

Theorem 3.1 follows directly from Theorem 4.1 below, so we prove only 
the latter in Appendix A.l. 

Our theorem implies that the Q-aggregate can compete with an arbi- 
trary f;^ in the convex hull with A G A. However, we are mainly interested 
in MS aggregation, where A is at a vertex of the simplex A. With z/ G (0, 1), 
the theorem implies that the Q-aggregate estimator is deviation optimal, 
unlike the aggregate with exponential weights. This is explicitly stated in 
the following corollary, which shows that our estimator solves optimally the 
problem of MS aggregation. Its proof follows by simply restricting the infi- 
mum over A to the minimum over its vertices in Theorem 3.1. Nonetheless, 



MSE(fs^) < min|MSE(/j) + ^lof' ^^'^^ 
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it is worth pointing out that our analysis focuses on deviation bounds, and it 
does not allow us to recover (2.2) for the aggregate with exponential weights 
when v = 1. 

Corollary 3.1. Under the assumptions of Theorem 3.1, the Q-aggregate 
estimator f^ satisfies 

MSE(f^) < min|MSE(/j) + ^log( ^ 
with probability 1 — 6. Moreover, 

J [ ' n 

Remark 3.1. If we set p{t) = 1 and employ the uniform prior ttj = 
\/M,j = 1, . . . ,M, then the optimization of the criterion Q is independent 
of /3. In this case, we may simply set v = 1/2, and the Q-aggregate estimator 
becomes parameter free, and we recover the original aggregate of Rigollet 
(2012). 

4. Algorithms. In the previous section, we introduced and analyzed the 
Q-aggregate estimator. It can be easily seen that if M is moderate, then it 
can be computed efficiently since it requires solving the convex optimization 
problem (3.2). The purpose of this section is to propose greedy model aver- 
aging (GMA) procedures that can approximately solve the Q-aggregation 
formulation (3.2). Moreover, GMA leads to sparse estimators (i.e., the result- 
ing estimators only aggregate a small number of dictionary functions) that 
achieve the optimal deviation bounds. These algorithms are thus appealing 
for their simplicity and statistical interpretability. 

4.1. Approximate Q -aggregation. Most numerical optimization algorithms 
do not find the exact minimum of the objective function Q, but only ap- 
proximate solutions. We introduce two algorithms that minimize Q approx- 
imately, with a very specific error term for the optimization task. It relies on 
the following quantity. Given a dictionary Ti, for any A € A, let y( A) denote 
its variance on % and be defined by 

M 

v{\) = Y.H\f,-h\?. 

For given ey,e > 0, we call f^ an (ey,e) -approximate Q-aggregate if the 
vector of weights A^ € A is such that 

(4.1) Q{\e)<m:in{Q{\) + evV{\)+e]. 

AeA 
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Before going into the detailed description of the algorithms we state a gener- 
alization of Theorem 3.1 that is valid not only for exact minimizers of Q but 
also for approximate minimizers. Hereafter, we use the convention 0/0 = 0. 

Theorem 4.1. Let e,ev,T^ > be such that 1/ + ey <l and fix n E A. 
Moreover, assume that the noise random variables £,1, ■ ■ ■ ,£,n are independent 
sub-Gaussians with variance proxy a^ . Fix any G (ev'/(z/ + ey), 1], and 
choose /3 > such that 

(4.2) p > 2o-2 maxi 



u-ev{l- 9)/9' (1 - ^)(1 -u-ev)}' 

Then for any 6 £ (0,1), any {ey^e)- approximate Q-aggregate estimator i^ 
satisfies 

(4.3) MSEif-^)<mm\piX) + evV{X) + ^ + ^fCpiX,7r)] + ^log{l/6), 

^ AgA [ n j n 

with probability 1 — 6. Moreover, 

(4.4) EMSE(f3^J < miJp{X) + evV{X) + | + ^/Cp(A,7r) 

Remark 4.1. Ifey = 0, then (4.2) reduces to /3 > 2o-^/min(i/; (l-0)(l- 
l/)). Thus if u< 1/2, we can take 9 = 1 - u{l - u) and /3 > la^u. If u > 
1/2, then for any 9 E (0, 1], we have min(iy; (1 - 9){1 - u)) = {l-9){l-u). 
Furthermore, if e = 0, then in the case v > 1/2 we can let ^ — )• and obtain 
Theorem 3.1. 

Remark 4.2. Theorem 4.1 is related to PAC-Bayes-type inequalities 
that also employ entropy regularization. In particular, the proof involves 
an interpolated risk with variance correction, and such techniques have also 
appeared in earlier papers such as Audibert (2004) under different context. 

Clearly the smaller the ey and e, the better the oracle inequality. Never- 
theless, in the canonical example where vr is the uniform prior, it is sufficient 
to have e uniformly bounded by C(log M)/n for some C > in order to main- 
tain a statistical accuracy of the same order as that of the true Q-aggregate. 
However, if an estimator has error term with e = and a constant ey > 0, 
then it achieves a statistical accuracy of the same order as that of the true 
Q-aggregate because the variance term eyV vanishes at the vertices of the 
simplex A. This is the main reason to differentiate ey and e in (4.1). As we 
will show later on, specially designed greedy algorithms can lead to an error 
term with e = 0, and thus such greedy algorithms achieve optimal deviation 
bounds for MS aggregation. 

4.2. Greedy Q -aggregation. Optimizing convex functions over convex sets 
is the bread and butter of modern statistical computing, with many algo- 
rithms ranging from gradient descent to interior point (IP) methods [see. 
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Algorithm 1 Greedy model averaging (GMA-1 and GMA-1+) 

Input: Noisy observation Y, dictionary % = {/i, . . . , /a/}, prior tt G A, pa- 
rameters z^, /3. 
Output: Aggregate estimator f;^(fe) . 
Let A(°) = 0, f;^(o) = 0. 
for fe = 1,2, . . . do 
Set ak = fcl^ 
j(*^)=argminj(VQ(A(^')))j 

option-1 (GMA-1) AW = A^^^-^) + OfcCe^-^'") - A('=-i)) 
option-2 (GMA-1+) A^'^) = argminA6AQ(A) s.t. Aj = for j ^ 
{J«,...,JW} 
end for 



e.g., Boyd and Vandenberghe (2004) for a recent overview]. For simple con- 
straints sets such as the simplex A considered here, so-called proximal meth- 
ods [see, e.g.. Beck and Teboulle (2009)] have shown very promising per- 
formance, especially when M becomes large. However, the most efficient of 
these methods (IP and proximal methods) does not output a sparse solution 
in a general case. 

In the sequel, we focus on simple greedy model averaging algorithms (i.e., 
each iteration takes the form of a greedy selection of a function in the dictio- 
nary) that enjoy the following property. After k iterations, these algorithms 
return a vector A*-*^^ such that (i), A^*^^ has at most k nonzero coefficients, 
and (ii) f_,^(fe) is an approximate Q-aggvegate estimator, where the quality of 
the approximation will be made explicit. Specifically, appropriately designed 
greedy algorithms can give e = in (4.1) for all k>2, and thus achieve op- 
timal deviation bounds using only k>2 dictionary functions. 

Minimizing a quadratic objective over the simplex A is a common prob- 
lem in statistics and optimization. We focus on greedy algorithms introduced 
into the statistical literature by Jones (1992). In optimization, greedy algo- 
rithms over simplex A are known as Frank-Wolfe-type (or reduced gradient) 
methods. Their name refers to the original paper of Frank and Wolfe (1956). 

We consider a few variants of greedy algorithms described in Algorithms 1 
and 2. In these algorithms, e^-'^ denotes the jth vector of the canonical basis 
of M . Both algorithms can be seen as greedy algorithms that add at most 
one function from the dictionary at each iteration. This feature is attrac- 
tive as it outputs a A:-sparse solution that depends on at most k functions 
from the dictionary after k iterations. Each algorithm contains two vari- 
ants: GMA-0 and GMA-0+ in Algorithm 2, and GMA-1 and GMA-1 + in 
Algorithm 1. At the same sparsity level k, the GMA-0+ (resp., GMA-1_|_) 
variant can further reduce approximation error of GMA-0 (resp., GMA-1) 
in (4.1) via a more aggressive optimization step. This kind of additional 
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Algorithm 2 Greedy model averaging (GMA-0 and GMA-0+) 
Input: Noisy observation Y, dictionary % = {/i, . . . , /a/}, prior tt G A, pa- 
rameters z^, /3. 
Output: Aggregate estimator f;^(fe) . 
Let A(°) = 0, f;^(o) = 0. 
for fe = 1,2, . . . do 
Set ak = fcl^ 
J(*^) =argminjQ(A(''-i) +afc(e(j) - A^'^-i))) 

option-1 (GMA-0) A^ = A^'^'i) + OfcCe^-^'") - X^^-'^^) 
option-2 (GMA-0+) A^'^) = argminA6AQ(A) s.t. Aj = for j ^ 
{J«,...,JW} 
end for 



optimization is referred to as fully- corrective step [Shalev-Shwartz, Srebro 
and Zhang (2010)], which is known to improve performance in practice. The 
difference between Algorithms 1 and 2 is that the former uses first order in- 
formation, namely the gradient VQ, to pick the best coordinate J^ ' (which 
is the standard Frank- Wolfe procedure in the greedy algorithm literature), 
while the latter uses only zero order information, namely, the coordinate 
that minimizes the objective value Q(-) (which is relatively uncommon in 
the greedy algorithm literature). A similar algorithm with the purpose of 
solving MS aggregation has appeared in Dai and Zhang (2011). 

Note that both algorithms give approximate solutions A^^ that converges 
to the optimal solution of (3.2); that is, when k — t- oo, we have ey — ?■ and 
e — 7- in (4.1). The classical Frank- Wolfe style analysis of greedy algorithms 
leads to the same convergence rate for both approaches with error term of 
ey = and e > in (4.1). The result is presented below in Proposition 4.1. 
Moreover, we present a new analysis that differentiates these two algorithms. 
Specifically we obtain a convergence result in Theorem 4.2 below with error 
term of e = in (4.1) for Algorithm 2 when k >2 (but we are unable to 
prove the same result for Algorithm 1). The importance of achieving error 
with e = is that for k>2, Algorithm 2 can produce a A:-sparse approximate 
solution a' ' of (3.2) that achieves optimal deviation. 

The following proposition follows from the standard analysis in Frank 
and Wolfe (1956), Jones (1992). It shows that the estimators A^*^) from Al- 
gorithms 1 and 2 converge to the optimal solution of the Q-aggregation 
formulation (3.2). Therefore when /c— >-oo, A^^ achieves optimal deviation 
bound. However, a disadvantage of the bound is that the result does not 
imply optimal deviation bounds for A^^ when k is small (e.g., when k = 2). 

Proposition 4.1. Assume that the dictionary % is such that maxj ||/j || < 
L. Fix V G (0, 1/2) and vr G A. Moreover, assume that the noise random vari- 
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ables ^i,...,^n CLfe independent and sub- Gaussian with variance proxy o"^. 
Take p=l and 

V 

Then, for any k>\, the aggregate estimator ^y^^k) where X- ' is output by 
GMA-1 or GMA-0 (or GMA-1j^ or GMA-0+) after k steps, satisfies 

B / 1 \1 16(1 -ufL"^ 1 



l-2i^ A; + 3' 



MSE(f;,w) < mm|MSE(/,) + ^logj'-L') | + 
with probability 1 — 6. Moreover, 



1 - 2i/ A; + 3 



Remark 4.3. For simplicity, we consider the case of z^ < 1/2, although 
similar bounds can be obtained with v > 1/2. 

Remark 4.4. The result of Proposition 4.1 follows from the classical 
greedy algorithm analysis in Frank and Wolfe (1956), Jones (1992), Barron 
(1993). In particular, the result for A^'^-' output by GMA-1 is well known 
in the literature; see also Clarkson (2008), Jaggi (2011). For completeness, 
we include the proof in Appendix A. 3 especially since the greedy step in 
GMA-0 (and GMA-0+) is relatively uncommon. 

Remark 4.5. It is known that the fully-corrective variants GMA-0+ 
and GMA-1_|_ generally achieve better performance than their partially- 
corrective counterparts GMA-0 and GMA-1 at the same sparsity level k. 
Although our analysis does not show their advantages, faster convergence 
rates can be obtained for fully-corrective algorithms under additional as- 
sumptions [Shalev-Shwartz, Srebro and Zhang (2010)]. Since the issue is 
not essential for our paper, we only illustrate the benefit of fully-corrective 
updates by experiments. 

Remark 4.6. It follows from the proof of Proposition 4.1 that GMA-0 
can be used to optimize the function Q over the simplex A. Therefore, we 
can use it as a subroutine for option-2 in the description of Algorithms 2 
and 1. More precisely, the following bound holds: 

Q(A«)<minQ(A) + Ml-^. 
AeA k + 6 

For the approximation error — jrr^i — ¥+3 ^° ^^ °^ ^^^ same order as the 
estimation error, one may choose k such that 

16(1 - i^fL'^n 



k> 



l3{l-2u)log{l/TT, 
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where vTmax = maxj vtj. In particular, if vr is the uniform prior, then f)^(k) 
solves the problem of MS aggregation optimally after 

16{1 -i^fLhi 
- /?(! - 2u) log(M) 

iterations. 

Note that the above theorem requires the somewhat unpleasant assump- 
tion that the functions in the dictionary are uniformly bounded in || • || norm. 
Indeed, this assumption has not appeared so far and is therefore not natural 
in this problem. 

More importantly, the bound only leads to optimal deviation for large k 
of the order n/log(M). The cause of this unpleasant issue is that the error 
term is with e 7^ and ey =0 in (4.1). In order to obtain optimal deviation 
bound, we have to derive an error bound of the form (4.1) with either e = 
0(log{M)/n), or with e = and ey / 0. In the later case, we allow ey to be 
relatively large, which means that we do not have to solve (3.2) accurately. 
The following theorem shows that such an error bound (with e = 0) can be 
achieved via GMA-0 (and GMA-0-|_); in addition, this result removes the 
assumption on the boundedness of dictionary function. 

Theorem 4.2. Fix u £ {0,l),k >2 and vr G A. Moreover, assume that 
the noise random variables ^i, . . . ,,^„ are independent and sub- Gaussian with 
variance proxy cr^ . Take p = l and 

2 f 1 

/3 > 2(7 inf max{ 



ee(o,i] U - (4(1 -u){l- 9))/{{k + 3)0) ' 

il-9)il-u)il-A/{k + 3))\- 

Then the aggregate estimator i ^(k) where A'^^ is output by GMA-0 (or GMA-0^) 
after k steps, satisfies 

MSE(f;,(.)) <mm|MSE(/,) + ^\og(-l^\ |, 

with probability 1 — 5. Moreover, 

EMSE(f;,(fc,)<min|MSE(/j) + ^logf^j|. 

Remark 4.7. The theorem implies deviation bounds of the optimal 
order for all A; > 2, and the constant /3 decreases to 2cr^/min(z^, 1 — u) as in 
Theorem 3.1 when A; — )■ oo. Such results indicate that the choice of i' is not 
critical and any positive constant leads to the same optimal bound. However, 
we can optimize the constant by choosing i/ = 1/2 and we use this value in 
the simulations. 
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Moreover, a careful inspection of the proof indicates that f^(k) where A' ' 
is output by GMA-0 (or GMA-0+) after k steps is a (ey,0)-approximate 
Q-aggregate estimator with ey = 4(1 — i')/{k + 3). As a result, the condition 
z^ + ey < 1 of Theorem 4.1 requires that k>2. 

To get a better quantitative idea of the result, we illustrate the particular 
choice z^ = 1/2. In this case, it can be easily shown that the optimal 9 is 
given by O^. = 2/[\/k + 3 + 2). Therefore, in this case, one may take 

1 - 2/^/kT3 

In particular, for /c = 2, it is sufficient to take /3 = 20a^/{l + 2/\/5) > 370-^. 
Although it achieves the optimal rate for MS aggregation, the large constant 
implies that it is still beneficial to run the algorithm for more than two 
iterations. This is confirmed by our experiments. 

It is worth pointing out that with flat prior, the first stage estimator f;^{i) = 

/-■ is simply the empirical risk minimizer with j E argmin- MSE(/j). We 
have already pointed out that this estimator achieves sub-optimal deviation 
bounds; therefore the requirement of A; > 2 in our analysis is natural. With 
k = 2, the estimator f-^(2) is related to the STAR algorithm, which can be 
regarded as a two-stage greedy algorithm that minimizes the empirical loss 
function instead of the Q-aggregation loss investigated in this paper. This 
means that we cannot directly generalize the STAR algorithm to more than 
two stages since it converges to f^^pROj which is known to be suboptimal for 
MS aggregation. 

Notice that Theorem 4.2 has consequences on optimization problems be- 
yond the scope of this paper. Indeed, we constructed a greedy algorithm for 
which the approximation error at each iteration is expressed as a function 
(here evV) and not simply a constant as usual. This construction allowed 
us to derive convergence rate that achieves optimal deviation bounds for 
greedy model averaging, and to avoid stringent and unnatural conditions on 
the boundedness of the problem. One of the key aspects of the function eyV 
is that it vanishes on the set of vertices. We believe that this technique may 
find applications in other optimization problems. 

5. Numerical experiments. Although optimal deviation bounds are ob- 
tained for greedy Q aggregation with k>2, our analysis suggests that the 
performance can increase when k increases (due to reduced constants). The 
purpose of this section is to illustrate this behavior using numerical ex- 
amples. We focus on the average performance of different algorithms and 
configurations. 

We identify a function / with a vector {f{xi),...,f{xn))~^ G IR"- Define 
/i) • • • 1 I'm so that the n x M design matrix X = [/i, . . . , /m] has i.i.d. stan- 
dard Gaussian entries. Let In denote the identity matrix of M", and let 
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A ~ A/'n(0,/„,) be a random vector. The regression function is defined by 
1] = fi + 0.5 A. Note that typically /i will be the closest function to rj but 
not necessarily. The noise vector ^ ~ A/'„(0, 4I„) is drawn independently of X. 
We define the oracle model (OM) fj^,, where j* = argmin • MSE(/j). The 
performance difference between an estimator fj and the oracle model /jst is 
measured by the regret defined as 

i?(77) = MSE(r7)-MSE(/,g. 

We run GMA-0, GMA-0+, GMA-1 and GMA-1+ algorithms for k iter- 
ations up to A; = 40. The temperature /3 of the exponential weights (EXP) 
is tuned via 10-fold cross-validation. The projection aggregation (PROJ) 
estimator is obtained from GMA-0 with 1^ = for 250 iterations following 
Remark 4.6. The fully-corrective optimization steps in GMA-O4. and GMA- 
1-1- are implemented using GMA-0 and GMA-1 restricted to the support 
{J^^' , . . . , J^ '} at each step k. The purpose is to achieve better performance 
at the same sparsity level k. 

Since the target is ^ = /i + 0.5 A, and /i and A are random Gaussian 
vectors, the oracle model is likely /i (but it may not be /i due to the mis- 
specification vector A). The noise cr = 2 is relatively large, which implies 
a situation where the best convex aggregation does not outperform the or- 
acle model. This is the scenario considered in this paper. For simplicity, all 
algorithms use a flat prior ttj = 1/M for all j. 

The experiment is performed with the parameters n = 50, M = 200, and 
cr = 2, and repeated for 500 replications. In order to avoid cluttering, the de- 
tailed regret of different algorithms are given in Table 2 in the Appendix B. 
Table 1 is a simplified comparison of commonly used estimators (EXP and 
PROJ as well as STAR) with GMA-0, GMA-0+, GMA-1 and GMA-1+ and 
z^ = 0.5. The regret is reported using the "mean it standard deviation" for- 
mat. 

Table 1 
Performance comparison 



STAR EXP PROJ 

0.43 ±0.41 0.386 ±0.47 0.407 ±0.28 



i/ = 0.5 fe = l fc = 2 fe = 5 fc = 20 fe = 40 

GMA-0 0.508 ±0.76 0.42 ±0.53 0.358 ±0.42 0.336 ±0.38 0.332 ±0.37 

GMA-0+ 0.508 ±0.76 0.366 ±0.5 0.341 ±0.4 0.336 ±0.38 0.336 ±0.38 

GMA-1 0.54 ±0.79 0.683 ±0.44 0.391 ±0.38 0.342 ±0.36 0.334 ±0.37 

GMA-1+ 0.54 ±0.79 0.381 ±0.46 0.338 ±0.38 0.336 ±0.38 0.336 ±0.38 
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Fig. 1. Regrets R{i-^(k)) versus iterations k for u — 0.5,0.1,0, under 500 replications. 



The results in Table 1 indicate that for GMA-0 (or GMA-0+), from k = l 
(corresponding to MS aggregation) to k = 2, there is significant reduction of 
error. The performance of GMA-0 (or GMA-0-(_) with k = 2 is comparable 
to that of the STAR algorithm. This is not surprising as STAR can be 
regarded as the stage-2 greedy model averaging estimator based on empirical 
risk minimization. We also observe that the error keeps decreasing (but at 
a slower pace) when k > 2, which is consistent with Theorem 4.2. It means 
that in order to achieve good performance, it is necessary to use more stages 
than k = 2 [although this does not change the 0{l/n) rate for the regret, it 
can significantly reduce the constant]. It becomes better than EXP when k 
is as small as 5, which still gives a relatively sparse averaged model. 

Figure 1 compares the MSE performance of different values of u for greedy 
algorithms considered in the paper. It shows that for the scenario we are 
interested in (i.e., where the noise is relatively large, and the best single 
model is nearly as good as the best convex hull combination), it is beneficial 
to choose 1/ = 0.5. Note that the greedy procedure with z^ = converges to 
the convex hull projection aggregate estimator f_)^pROj which we have shown 
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to be sub-optimal for MS aggregation. Therefore these results are consistent 
with our theoretical analysis, and illustrate the importance of Q-aggregation 
with v > for MS aggregation. 

Figure 2 compares the MSE of different greedy procedures at i^ = 0.5 (ad- 
ditional comparisons at i/ = 0.1 and z^ = can be found in Figure 3 in the 
Appendix B). It shows that the classical first order greedy method GMA-1 
generally performs worse than GMA-0 for all k and especially when k is 
small. This is consistent with our theoretical analysis since Theorem 4.2 
only applies to GMA-0. The experiments show that the fully-corrective vari- 
ants GMA-O4. and GMA-1 + can potentially give more accurate results than 
GMA-0 and GMA-1 at the same sparsity level k. 

APPENDIX A: PROOFS 

A.l. Proof of Theorem 4.1. Let A be such that 

Q(A) < min{Q(A) + eyF(A) + e}. 
AeA 



Fix 6 G (0, 1) and for any A G A, define Ag G A by Ag = (1 - e)~X + OX. 
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Note that 

M 

P(\) - P{\e) = (1 - i^)[MSE(f^) - MSE(fAj] + vOY^ih " ^i) MSE(/,). 

i=i 
Moreover, it is not hard to verify that 

MSE(f3^) - MSE(fAj = eMSE(f3^) - eMSE(fA) + ^(1 - ^)||f^ - fA^ • 

The above two displays and the definition of P{\) yield 

(A.l) P(A) - P{\e) = 9[P{\) - P{\)] + e{l -e){l- i.)||f^ - f^f . 

Moreover, by the definition of A, we have 

Q{\)<Q{\e) + evV{\e)+e. 

By replacing Q{\) and Q{\e) with the expansion 

Q{X) = P{\) + (^,0 - 2(^,fA -v) + -/Cp(A,7r), 

n 

where ^ = Y — ry, we obtain 

P{\) - P{\e) < 2(^, f^^ - fA,) + ^/Cp(Ae, tt) - ^/Cp(A, vr) + evV{Xe) + e 

n n 

< 2{^,i-^ -h,) + ^/Cp(A,7r) - ^/Cp(A, vr) + evV{\e) + e, 

where in the second inequality, we applied Jensen's inequality with Xg = 
(1 — ^)A + 9X to the convex function A i— )• /Cp(A,7r). Plugging (A.l) into this 
and dividing by ^, we get 

^A) - P(A) < ii„(fA) - (1 - ^)(1 - uW, - hf 

[A. 2) 

+ ^MA,^) + ynA.) + ^, 
where, using the fact that f^ — fAg = 0{^^ — fx), we can take 

^n(fA)=2(^,f^-fA)-^/Cp(A,7r). 

The following lemma allows us to control Rni^x) both in expectation and 
with high probability. 

Lemma A.l. Let the noise vector ^ = (^i, . . . ,x„) be sub-Gaussian with 
variance proxy a^ . Then, for any /3 > 0, AG M , we have 

Eexp('^i?4fA)-^5^A,T,(A)'| <1, 
where TjiX) = \\fj-fx\\'. 
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Proof. Fix A € M . Using successively Jensen's inequality and the as- 
sumption that t < p{t) yields 



Eexp 



p.(|(^,/.-U-lo.(fl)-^T,A) 



<E^^A,exp(|(^,/,-f.)-log(^)-^T,(A)) 
<^vr,Eexp(^(e/,-f,)-^T,(A)). 



Observe now that since ^ is sub-Gaussian, we have 

Eexp(^(^,/,-f,)) <exp(^||f,-/,f) = exp(^T,(A)) . 

This completes the proof of our lemma. D 

To prove the first result of Theorem 4.1, note that Lemma A.l together 
with a Chernoff bound yield that for any 6 £ {0,1), 

(A,3) A.(f,)<^f;A,||/,-f,p + ^!MW 

with probability at least 1 — 5. By combining (A. 2) and (A. 3), and using the 
definition of -P(A), we obtain 

M 

(1 - u) MSE(f3^) + iy^Xj MSE(/j) 

- (1 - e){l - u)\\i-^ -hf + ^/C,(A,7r) + '-fv{Xe) + '-. 
The following identities follows directly from algebra: 

M 

^A,MSE(/,) = MSE(f3^) + F(A), 
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M 

Y.~^^\f3-h\? = v{\) + \\^-^-h\?■ 



Together with (A. 4), they yield 



MSE(fr) < P{\) + 



2a^ 

T 



u]Vi\) + 



/31og(l/5) 



n 



(A.5) 



+ 



2cr2 



n 



+ ^nA.) + !. 



We now use the following identity which again follows directly from alge- 
bra: 



IfX-fAlP. 



v{\e) = dv{\) + (1 - e)v{\) + e{i - 

Together with (A.5), we obtain 
MSE(f.) < P{\) + GiV{\) + Gsllf. - h\? + -/Cp(A, (Jvr) + evV{\) 



n 



where 6tt = {Stti, . . . , 6-km)'^ , 



and 



Go 



Gi 



2a' 



2a^ 



y + 



ev{i-e) 



a 



{\-Q){\-v)^ey{\-d). 



To complete the proof of (4.3), it is sufficient to note that choosing /3 as 
in (4.2) ensures that Gi < and Gi < 0. 

Using the convexity inequality t < e* — 1 for any t G M, it yields that (A. 3) 
also holds in expectation. The proof of (4.4) is then concluded in the same 
way as the proof of (4.3) by making statements in expectation instead of 
statements that hold with high probability. 

A. 2. Proof of Theorem 4.2. It follows from a Taylor expansion that for 
any /x, /x' S A, we have 

(A.6) Qii^) = Q{fi') + (/i - li'YvQi^i') + (1 - uW^ - f^'f • 

Observe also that for any A € A, we have (both for GMA-0 and GMA-O4.) 

M 

^(AC^+i)) < ^ a,Q(AW +afc+i(e(^-) - A^'^))). 



DEVIATION OPTIMAL LEARNING 25 

Expanding each term on the right-hand side using (A. 6) with // = A'^ + 
afc+i(e(j) - A(*^)) and n' = A^'^) yields 

M 

Q(A('=+i)) < Q(A('=)) + al^,{l -v)Y, Aill/, - h^^) \? 

i=i 

(A.7) 

+ aA:+i(A-A('=))^VQ(A('=)). 
Note that 

M M 

Moreover, applying (A. 6) with fi = X and /x' = A^'^^ yields 

ak+i{\ - A('=))^VQ(A('=)) = ak+i[Q{X) - Q(A('=))] - (1 - u)ak+i\\f^w - h\\^ ■ 

Plugging the above two displays into (A.7) and using a|._^]^ — a^+i < 0, we 
get 

M 

This can be written as 

(A.8) 5k+i<{l-ak+i)5k + al+^B, 

where 

M 

<5fc = Q(A('=))-Q(A), i? = (i-i.)^A,||/,-fAf. 

i=i 

To conclude that 

(A.9) i.<^, 

we proceed by induction on k. It is easy to see from (A.8) with k = and 
ai = l that 5i < B. 

Now for /c > 1, bound (A.8) yields 

^(^ 2 \ 4i3 / 2 y^^ 4(fc2 + 3fc + 3)5 ^^ 45 



2 + A;y/c + 3 \2 + kJ (A; + 2)2(/c + 3) " fc + 4' 
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where in the second inequahty, we used (A. 9). We have proved that for 
any A, it holds 

To complete the proof, we check that the assumptions of Theorem 4.1 
with ey = 4(1 — i')/{k + 3) and e = are satisfied. Moreover, using expres- 
sion (4.2), we get the desired bound on /?. To conclude, notice that V{\) 
vanishes at the vertices of the simplex A. 

A. 3. Proof of Proposition 4.1. Similarly to the proof of Theorem 4.2, for 
both GMA-1 and GMA-1+, we have 

Q{\^'^'^)-{l-v)al^,\\fji.,-hi^)f 

= Q(AW) + afc+i(e(-^''=') - aW)^VQ(AW) 

M 

< ^ A,[Q(aW) + afc+i(e(^') - A(^'))^VQ(A(^))] 
i=i 

= Q(A«) + afc+iVQ(AW)^(A - A«) 

= Q(AW) + afc+i[Q(A) - Q(AW) - (1 - z.)||fA - f,(.) f ]. 

Using ||/j(fc) — f;)^(fc) IP < 4L^, we obtain 

(A.IO) 5k+i<{l-ak+i)5k + al+iB\ 

where we define 

<5fc = Q(A(^^))-Q(A), B' = A{l-v)L\ 

Note that (A.IO) also holds for GMA-0 and GMA-0+ due to (A.8). Therefore 
similarly to the proof of Theorem 4.2, we can solve the recursion in (A.IO) 
to obtain 

AB' _ 16(1 - i/)L2 

''- k + 2, ~ k + 2, 

That is, we have 

We can thus apply Theorem 4.1 with ey = 0, e = 16(1 — iy)L'^/{k + 3), and 
9 = (1 — 2u)/{l — I') to complete the proof. 
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APPENDIX B: DETAILED PERFORMANCE TABLE AND FIGURES 



Table 2 
Regret of different algorithms: oracle model is superior to averaged models 





STAR 




EXP 


PROJ 






0.43 ±0.41 0.386 ±0.47 0.407 ±0.28 


















fe = l 


fc = 2 


fc = 5 


fc = 20 


fe = 40 


GMA-0 












i^ = 0.5 


0.508 ±0.76 


0.42 ±0.53 


0.358 ±0.42 


0.336 ±0.38 


0.332 ±0.37 


1^ = 0.1 


0.508 ±0.76 


0.523 ±0.5 


0.424 ±0.35 


0.394 ±0.3 


0.389 ±0.3 


i/ = 


0.508 ±0.76 


0.55 ±0.48 


0.444 ± 0.34 


0.411 ±0.29 


0.409 ±0.28 


GMA-0+ 












i/ = 0.5 


0.508 ±0.76 


0.366 ±0.5 


0.341 ±0.4 


0.336 ±0.38 


0.336 ±0.38 


u = G.l 


0.508 ±0.76 


0.387 ±0.44 


0.391 ±0.33 


0.394 ±0.3 


0.394 ±0.3 


u = 


0.508 ±0.76 


0.396 ±0.43 


0.403 ± 0.32 


0.411 ±0.29 


0.411 ±0.29 


GMA-1 












i/ = 0.5 


0.54 ±0.79 


0.683 ±0.44 


0.391 ±0.38 


0.342 ±0.36 


0.334 ±0.37 


i/ = 0.1 


0.58 ±0.83 


0.897 ±0.35 


0.49 ±0.31 


0.41 ±0.29 


0.399 ±0.29 


u = 


0.609 ±0.84 


0.937 ±0.32 


0.528 ±0.3 


0.428 ±0.27 


0.415 ±0.28 


GMA-1+ 












u = 0.5 


0.54 ±0.79 


0.381 ±0.46 


0.338 ±0.38 


0.336 ±0.38 


0.336 ±0.38 


u = 0.1 


0.58 ±0.83 


0.459 ±0.45 


0.4 ±0.31 


0.395 ±0.3 


0.395 ±0.3 


u = G 


0.609 ±0.84 


0.488 ±0.45 


0.418 ±0.3 


0.411 ±0.29 


0.411 ±0.29 



v=0.1 



v=0 





Fig. 3. Regrets R{f^(k)) versus iterations k of different gn 
V — 0, under 500 replications. 



lures at ly = 0.1 and 
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